This sample notebook provides you with an introduction to many features included in PixieDust. You can find more information about PixieDust at https://pixiedust.github.io/pixiedust/. To ensure you are running the latest version of PixieDust uncomment and run the following cell. Do not run this cell if you installed PixieDust locally from source and want to continue to run PixieDust from source.
In [ ]:
#!pip install --user --upgrade pixiedust
In [ ]:
import pixiedust
PixieDust includes a Spark Progress Monitor bar that lets you track the status of your Spark jobs. You can find more info at https://pixiedust.github.io/pixiedust/sparkmonitor.html. Run the following cell to enable the Spark Progress Monitor:
In [ ]:
pixiedust.enableJobMonitor();
You can use the PackageManager component of Pixiedust to install and uninstall maven packages into your notebook kernel without editing configuration files. This component is essential when you run notebooks from a hosted cloud environment and do not have access to the configuration files. You can find more info at https://pixiedust.github.io/pixiedust/packagemanager.html. Run the following cell to install the GraphFrame package. You may need to restart your kernel after installing new packages. Follow the instructions, if any, after running the cell.
In [ ]:
pixiedust.installPackage("graphframes:graphframes:0.1.0-spark1.6")
print("done")
Run the following cell to print out all installed packages:
In [ ]:
pixiedust.printAllPackages()
PixieDust lets you visualize your data in just a few clicks using the display() API. You can find more info at https://pixiedust.github.io/pixiedust/displayapi.html. The following cell creates a DataFrame and uses the display() API to create a bar chart:
In [ ]:
sqlContext=SQLContext(sc)
d1 = sqlContext.createDataFrame(
[(2010, 'Camping Equipment', 3),
(2010, 'Golf Equipment', 1),
(2010, 'Mountaineering Equipment', 1),
(2010, 'Outdoor Protection', 2),
(2010, 'Personal Accessories', 2),
(2011, 'Camping Equipment', 4),
(2011, 'Golf Equipment', 5),
(2011, 'Mountaineering Equipment',2),
(2011, 'Outdoor Protection', 4),
(2011, 'Personal Accessories', 2),
(2012, 'Camping Equipment', 5),
(2012, 'Golf Equipment', 5),
(2012, 'Mountaineering Equipment', 3),
(2012, 'Outdoor Protection', 5),
(2012, 'Personal Accessories', 3),
(2013, 'Camping Equipment', 8),
(2013, 'Golf Equipment', 5),
(2013, 'Mountaineering Equipment', 3),
(2013, 'Outdoor Protection', 8),
(2013, 'Personal Accessories', 4)],
["year","zone","unique_customers"])
display(d1)
Data scientists working with Spark may occasionaly need to call out to one of the hundreds of libraries available on spark-packages.org which are written in Scala or Java. PixieDust provides a solution to this problem by letting users directly write and run scala code in its own cell. It also lets variables be shared between Python and Scala and vice-versa. You can find more info at https://pixiedust.github.io/pixiedust/scalabridge.html.
Start by creating a python variable that we'll use in scala:
In [ ]:
python_var = "Hello From Python"
python_num = 10
Create scala code that use the python_var and create a new variable that we'll use in Python:
In [ ]:
%%scala
println(python_var)
println(python_num+10)
val __scala_var = "Hello From Scala"
Use the __scala_var from python:
In [ ]:
print(__scala_var)
PixieDust includes a number of sample data sets. You can use these sample data sets to start playing with the display() API and other PixieDust features. You can find more info at https://pixiedust.github.io/pixiedust/loaddata.html. Run the following cell to view the available data sets:
In [ ]:
pixiedust.sampleData()
In [ ]:
pixiedust.installPackage("com.databricks:spark-csv_2.10:1.5.0")
pixiedust.installPackage("org.apache.commons:commons-csv:0")
Run the following cell to get the first data set from the list. This will return a DataFrame and assign it to the variable d2:
In [ ]:
d2 = pixiedust.sampleData(1)
Pass the sample data set (d2) into the display() API:
In [ ]:
display(d2)
You can also download data from a CSV file into a DataFrame which you can use with the display() API:
In [ ]:
d3 = pixiedust.sampleData("https://openobjectstore.mybluemix.net/misc/milliondollarhomes.csv")
PixieDust comes complete with logging to help you troubleshoot issues. You can find more info at https://pixiedust.github.io/pixiedust/logging.html. To access the log run the following cell:
In [ ]:
% pixiedustLog -l debug
In [ ]:
%%scala
val __scala_version = util.Properties.versionNumberString
In [ ]:
import platform
print('PYTHON VERSON = ' + platform.python_version())
print('SPARK VERSON = ' + sc.version)
print('SCALA VERSON = ' + __scala_version)
For more information about PixieDust check out the following: